1 Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.

library(fivethirtyeight)
data(drinks)

What are the variable types? Any missing values we should worry about?

glimpse(drinks)
## Rows: 193
## Columns: 5
## $ country                      <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings                <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings              <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings                <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
#The variable types can be seen as below. For country it is character, for beer_serving, spirit_servings and wine_servings they are integer, for total_litres_of_pure_alcohol it is double.
skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁
# No missing values as seen in the result below in column "n_missing".

Make a plot that shows the top 25 beer consuming countries

# 
top_25_beer <- drinks %>%
  top_n(25,beer_servings)

ggplot(top_25_beer,aes(y=fct_reorder(country,beer_servings),x=beer_servings))+
  geom_col(fill="green")+
  labs(title = "Top 25 beer consuming countries",x = "Beer servings", y = "Country")+
  NULL

Make a plot that shows the top 25 wine consuming countries

top_25_wine <- drinks %>%
  top_n(25,wine_servings)

ggplot(top_25_wine,aes(y=fct_reorder(country,wine_servings),x=wine_servings))+
  geom_col(fill="red")+
  labs(title = "Top 25 wine consuming countries",x = "Wine servings", y = "Country")+
  NULL

Finally, make a plot that shows the top 25 spirit consuming countries

top_25_spirit <- drinks %>%
  top_n(25,spirit_servings)

ggplot(top_25_spirit,aes(y=fct_reorder(country,spirit_servings),x=spirit_servings))+
  geom_col(fill="blue")+
  labs(title = "Top 25 spirit consuming countries",x = "Spirit servings", y = "Country")+
  NULL

What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).

Namibia has the highest beer consumption, which is unexpected. This maybe because it’s a former German colony, in which beer culture could have been established historically, leading to a higher beer consumption. France has the highest wine consumption, which is expected. It is a culture in France to associate every meal with a wine-pairing, and France is the second largest wine producing country in the world. Portugal has the second highest wine consumption, which is also expected, given its tradition for producing and consuming fortified wines, such as Port and Madeira. Grenada has the highest spirit consumption. This is probably due to the fact that spirits make up 66% of all alcohol consumption in this country.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

2.1 Use your data import, inspection, and cleaning skills to answer the following:

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
  • Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁
# There is no missing values as seen in the result below. Not all the entries are distinct as the n_unique values do not match, indicating that several values were repeated. Based on the number of rows 2961 and n_unique 2907, we can conclude that 54 rows are duplicated.

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast memebrs received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating
unique_movies <- movies %>%
  distinct(title,genre,director,year,.keep_all=TRUE)
new_unique_movies <- movies %>%
  summarise("number of unique movies"=n()) 
  • Produce a table with the count of movies by genre, ranked in descending order
count_movie_genre <- unique_movies %>%
  group_by(genre) %>%
  count(sort=TRUE)

colnames(count_movie_genre) <- c("Genre", "Number of Movies")

count_movie_genre %>% 
  kable()
Genre Number of Movies
Comedy 844
Action 719
Drama 484
Adventure 281
Crime 198
Biography 135
Horror 128
Animation 35
Fantasy 26
Documentary 25
Mystery 15
Sci-Fi 7
Family 3
Musical 2
Romance 2
Western 2
Thriller 1
  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order
return_movie_genre <- unique_movies %>%
  group_by(genre) %>% 
  summarise(mean_gross = mean(gross),mean_budget = mean(budget)) %>% 
  mutate(return_on_budget = 
           ((mean_gross - mean_budget)/mean_budget)) %>% 
  arrange(-return_on_budget)

colnames(return_movie_genre) <- c("Genre", "Mean gross [$]", "Mean budget [$]", "Return on budget")
  
return_movie_genre %>% 
  kable()
Genre Mean gross [$] Mean budget [$] Return on budget
Musical 9.21e+07 3189500 27.871
Family 1.49e+08 14833333 9.056
Western 2.08e+07 3465000 5.009
Documentary 1.74e+07 5887852 1.947
Horror 3.78e+07 13804379 1.737
Fantasy 4.19e+07 18484615 1.267
Comedy 4.25e+07 24458506 0.737
Mystery 6.91e+07 41500000 0.665
Animation 9.84e+07 61701429 0.595
Biography 4.52e+07 28543696 0.584
Adventure 9.44e+07 64692313 0.458
Drama 3.68e+07 25832605 0.423
Crime 3.76e+07 26527405 0.417
Romance 3.13e+07 25107500 0.245
Action 8.63e+07 70774558 0.219
Sci-Fi 2.98e+07 27607143 0.079
Thriller 2.47e+03 300000 -0.992
  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
top_15_directors <- unique_movies %>%
  select(director, gross) %>% 
  group_by(director) %>% 
  summarise(total_gross = sum(gross), 
            mean_gross = mean(gross), 
            median_gross = median(gross), 
            sd_gross = sd(gross)) %>% 
  top_n(15,total_gross) %>% 
  arrange(-total_gross)

colnames(top_15_directors) <- c("Director", "Total gross [$]", "Mean gross [$]", "Median gross [$]", "Standard deviation of gross [$]")
  
top_15_directors %>% 
  kable()
Director Total gross [$] Mean gross [$] Median gross [$] Standard deviation of gross [$]
Steven Spielberg 4.01e+09 1.75e+08 1.64e+08 1.01e+08
Michael Bay 2.20e+09 1.83e+08 1.68e+08 1.26e+08
James Cameron 1.91e+09 3.18e+08 1.76e+08 3.09e+08
Christopher Nolan 1.81e+09 2.27e+08 1.97e+08 1.87e+08
George Lucas 1.74e+09 3.48e+08 3.80e+08 1.46e+08
Robert Zemeckis 1.62e+09 1.25e+08 1.01e+08 9.13e+07
Tim Burton 1.56e+09 1.11e+08 6.98e+07 9.93e+07
Sam Raimi 1.44e+09 1.80e+08 1.38e+08 1.75e+08
Clint Eastwood 1.38e+09 7.25e+07 4.67e+07 7.55e+07
Francis Lawrence 1.36e+09 2.72e+08 2.82e+08 1.35e+08
Ron Howard 1.34e+09 1.11e+08 1.02e+08 8.19e+07
Gore Verbinski 1.33e+09 1.90e+08 1.23e+08 1.54e+08
Andrew Adamson 1.14e+09 2.84e+08 2.80e+08 1.21e+08
Shawn Levy 1.13e+09 1.03e+08 8.55e+07 6.55e+07
Ridley Scott 1.13e+09 8.06e+07 4.78e+07 6.88e+07
  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
dist_by_genre <- unique_movies %>%
  select(genre, rating) %>% 
  group_by(genre) %>% 
  summarise(mean_rating = mean(rating), 
            min_rating = min(rating), 
            max_rating = max(rating), 
            sd_rating = sd(rating)) 

colnames(dist_by_genre) <- c("Genre", "Mean rating", "Min rating", "Max rating", "Standard deviation of rating")


dist_by_genre %>% 
  kable()
Genre Mean rating Min rating Max rating Standard deviation of rating
Action 6.23 2.1 9.0 1.039
Adventure 6.51 2.3 8.6 1.106
Animation 6.65 4.5 8.0 0.968
Biography 7.11 4.5 8.9 0.760
Comedy 6.11 1.9 8.8 1.024
Crime 6.92 4.8 9.3 0.853
Documentary 6.66 1.6 8.5 1.767
Drama 6.74 2.1 8.8 0.915
Family 6.50 5.7 7.9 1.217
Fantasy 6.08 4.3 7.9 0.953
Horror 5.79 3.6 8.5 0.987
Musical 6.75 6.3 7.2 0.636
Mystery 6.84 4.6 8.5 0.910
Romance 6.65 6.2 7.1 0.636
Sci-Fi 6.66 5.0 8.2 1.094
Thriller 4.80 4.8 4.8 NA
Western 5.70 4.1 7.3 2.263
unique_movies %>%
  ggplot(aes(x=rating)) + 
  geom_histogram() +
  facet_wrap(vars(genre)) +
  labs(title = "Rating statistics of each genre", x = "Rating", y = "Count")

  NULL
## NULL

2.2 Use ggplot to answer the following

  • Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?
unique_movies %>%
  ggplot(aes(y=gross, x=cast_facebook_likes)) + 
  geom_point() +
  labs(title = "Gross vs. Cast facebook likes", x = "Cast Facebook Likes", y = "Gross [$]")+
  NULL

#It's unlikely that number of facebook likes the cast received will be a good predictor of how much money a movie will make. X-axis is the independent variable "cast_facebook_likes", Y-axis is the dependent variable "gross". There is no obvious correlation between the two variables.
  • Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
unique_movies %>%
  ggplot(aes(y=gross, x=budget)) + 
  geom_point() +
  labs(title = "Gross vs. Budget", x = "Budget [$]", y = "Gross [$]")+
  NULL

#It can be inferred from the plot that budget is a weak predictor of how much money a movie will make. There is a lot of variation and scatter in the data thus the correlation is very weak.
  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?
unique_movies %>%
  ggplot(aes(y=gross, x=rating)) + 
  geom_point() +
  facet_wrap(vars(genre))+
  labs(title = "Gross vs. the rating of the movie genre", x = "Rating", y = "Gross [$]")+
  NULL

#In the genre where there are enough data instances, the rating can be a good predictor for how much money a movie will make at the box office. This is expected because movies that are more popular are more likely to make more money. In the Action and Adventure genres, where there are lots of data points, the trend seems to be exponential. In the genres with few data points, it's hard to determine what the relationship is between the two variables.

3 Returns of financial stocks

You may find useful the material on finance data sources.

nyse <- read_csv(here::here("data","nyse.csv"))

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

companies_per_sector <- nyse %>%
  group_by(sector) %>%
  count(sort=TRUE)

colnames(companies_per_sector) <- c("Sector", "Number of Companies")
  
companies_per_sector %>% 
  kable()
Sector Number of Companies
Finance 97
Consumer Services 79
Public Utilities 60
Capital Goods 45
Health Care 45
Energy 42
Technology 40
Basic Industries 39
Consumer Non-Durables 31
Miscellaneous 12
Transportation 10
Consumer Durables 8
colnames(companies_per_sector) <- c("sector", "number_of_companies")

companies_per_sector %>% 
  ggplot(aes(reorder(sector,-number_of_companies), number_of_companies)) +
  geom_bar(stat="identity")+
  labs(title = "Number of companies in each sector", x = "Sector", y = "Number of Companies")+
  NULL

Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("AAPL","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2021-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 18,781
## Columns: 8
## Groups: symbol [7]
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high     <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low      <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close    <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume   <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.1, 10.2, 10.2, 10.2, 10.3, 10.5, 10.5, 10.6, 10.6, 10.7, 1…

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

summarise_monthly_returns <- myStocks_returns_monthly %>%
  group_by(symbol) %>% 
  summarise(min_monthly_return = min(monthly_returns), 
            max_monthly_return = max(monthly_returns),
            median_monthly_return = median(monthly_returns), 
            mean_monthly_return = mean(monthly_returns), 
            sd_monthly_return = sd(monthly_returns)) 

colnames(summarise_monthly_returns) <- c("Stock","Minimum Monthly Return [$]", "Maximum Monthly Return [$]", "Median of Monthly Return [$]", "Mean of Monthly Return [$]", "Standard Deviation of Monthly Return [$]")
 

summarise_monthly_returns %>% 
  kable()
Stock Minimum Monthly Return [$] Maximum Monthly Return [$] Median of Monthly Return [$] Mean of Monthly Return [$] Standard Deviation of Monthly Return [$]
AAPL -0.181 0.217 0.026 0.024 0.078
ANF -0.421 0.507 0.003 0.009 0.145
DIS -0.179 0.234 0.009 0.016 0.068
DPZ -0.188 0.342 0.025 0.031 0.075
SPY -0.125 0.127 0.017 0.012 0.038
TSLA -0.224 0.811 0.015 0.052 0.176
XOM -0.262 0.233 0.002 0.003 0.066

Plot a density plot, using geom_density(), for each of the stocks

myStocks_returns_monthly %>%
  ggplot(aes(x=monthly_returns)) + 
  geom_density() +
  facet_wrap(vars(symbol))+
  labs(title = "Monthly return of each of the stocks", x = "Monthly return [$]", y = "Count")+
  NULL

What can you infer from this plot? Which stock is the riskiest? The least risky?

The risk of the stock can be determined by visually analyzing the proportion of the graph area that lies in the negative direction of the x-axis. This determines how the stocks were performing and whether they were likely to produce negative returns. Based on this, the riskiest stock would be XOM, because its past performance has shown that it often gives negative returns. The least risky stock would be DPZ, because its past performance indicates that it is likely to produce positive returns.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

library(ggrepel)
myStocks_returns_monthly %>% 
  group_by(symbol) %>% 
  summarise(mean_monthly_return = mean(monthly_returns), 
            sd_monthly_return = sd(monthly_returns)) %>% 
  ggplot(aes(y=sd_monthly_return, x=mean_monthly_return, label = symbol)) + 
  geom_point() +
  geom_text_repel()+
  labs(title = "The risk of each of the stocks expressed as standard deviation vs. the monthly return", x = "Monthly return [$]", y = "Standard deviation [$]")+
  NULL

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

Based on this graph, ANF had a broad distribution of monthly returns but not a higher expected return when compared to other stocks. Therefore it is of high risk but not of high expected return.

4 On your own: IBM HR Analytics

For this task, you will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.

First let us load the data

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)

Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:

  1. How often do people leave the company (attrition)
leave_count <- hr_cleaned %>% 
  select(attrition) %>%
  group_by(attrition) %>% 
  summarise(attrition_count = count(attrition))

leave_count %>%
  ggplot(aes(x=attrition,y=attrition_count)) + 
  geom_col() +
  labs(title = "Number of people who leave the company expressed as attrition", x = "Attrition", y = "Count")+
  NULL

leave_count[2,2]/sum(leave_count$attrition_count)
##   attrition_count
## 1           0.161
# 16.1% of the people in this survey leave the company.
  1. How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?
employee_info <- hr_cleaned %>% 
  select(age, years_at_company, monthly_income, years_since_last_promotion) 

age_plot <- ggplot(employee_info, aes(x=age)) + 
  geom_density() +
  labs(title = "Density plot of employees' age", x = "Age", y = "Density")+
  NULL

years_plot <- ggplot(employee_info, aes(x=years_at_company)) + 
  geom_density() +
  labs(title = "Density plot of how long employees have stayed in the company", x = "Years at company", y = "Density")+
  NULL

income_plot <- ggplot(employee_info, aes(x=monthly_income)) + 
  geom_density() +
  labs(title = "Density plot of employees' monthly income", x = "Monthly income", y = "Density")+
  NULL

promotion_plot <- ggplot(employee_info, aes(x=years_since_last_promotion)) + 
  geom_density() +
  labs(title = "Density plot of how long it has been since employees' are last promoted", x = "Years since last promotion", y = "Density")+
  NULL

ggarrange(age_plot,years_plot,income_plot,promotion_plot, ncol=2, nrow=2)

#By looking at the above plots, the "age_plot" is closer to normal distribution.
  1. How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total
job_satis_dist <- hr_cleaned %>% 
  select(job_satisfaction) %>%
  group_by(job_satisfaction) %>%
  summarise(cnt = n()) %>%
  mutate(freq = percent(cnt / sum(cnt)))
  

  
job_satis_dist %>% 
  ggplot(aes(reorder(job_satisfaction,-cnt), cnt)) +
  geom_bar(stat="identity")+
  geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
  labs(title = "Distribution of job satisfaction", x="Job satisfaction", y="Count")+
  NULL 

work_life_dist <- hr_cleaned %>% 
  select(work_life_balance) %>%
  group_by(work_life_balance) %>%
  summarise(cnt = n()) %>%
  mutate(freq = percent(cnt / sum(cnt)))
  

  
work_life_dist %>% 
  ggplot(aes(reorder(work_life_balance,-cnt), cnt)) +
  geom_bar(stat="identity")+
  geom_text(aes(label = freq), vjust = 1.5, colour = "white")+
  labs(title = "Distribution of work life balance", x="Work life balance", y="Count")+
  NULL 

  1. Is there any relationship between monthly income and education? Monthly income and gender?
hr_cleaned %>% 
  select(education,monthly_income) %>% 
  ggplot(aes(x=education, y=monthly_income)) +
  geom_boxplot()+
  labs(title = "Relationship between monthly income and education", x = "Education", y = "Monthly income") +
  NULL

income_gender <- hr_cleaned %>% 
  group_by(gender) %>% 
  ggplot(aes(x=gender, y=monthly_income)) +
  geom_boxplot()+
  labs(title = "Relationship between monthly income and gender", x = "Gender", y = "Monthly income") +
  NULL

income_gender

  1. Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first
hr_cleaned %>% 
  mutate(job_role = fct_rev(fct_reorder(job_role, monthly_income, median))) %>% 
  ggplot(aes(x=job_role, y=monthly_income)) +
  geom_boxplot()+
  labs(title = "Relationship between monthly income and job role", x = "Job role", y = "Mean monthly income") +
  NULL

  1. Calculate and plot a bar chart of the mean (or median?) income by education level.
income_edu <- hr_cleaned %>% 
  group_by(education) %>% 
  summarise(mean_income = mean(monthly_income)) %>% 
  ggplot(aes(x=education, y=mean_income)) +
  geom_col()+
  labs(title = "Relationship between monthly income and education", x = "Education", y = "Mean monthly income") +
  NULL
  1. Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes
hr_cleaned %>% 
  ggplot(aes(x=monthly_income)) +
  ggthemes::theme_wsj() +
  geom_density() +
  facet_wrap(vars(education)) +
  labs(title = "Distribution of monthly 
income by education level", x = "Education", y = "Density") +
  NULL

  1. Plot income vs age, faceted by job_role
hr_cleaned %>% 
  ggplot(aes(y=monthly_income, x=age)) +
  geom_point() +
  facet_wrap(vars(job_role))+
  labs(title = "Relationship between income and age", x = "Age", y = "Monthly income") +
  NULL

5 Challenge 1: Replicating a chart

The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.

First we look at the data if there’s missing values.

skim(election2020_results)
Data summary
Name election2020_results
Number of rows 22093
Number of columns 12
_______________________
Column type frequency:
character 8
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 51 0
state_po 0 1 2 2 0 51 0
county_name 0 1 3 21 0 1865 0
fips 9 1 5 5 0 3153 0
office 0 1 9 12 0 2 0
candidate 0 1 5 17 0 4 0
party 0 1 5 11 0 5 0
mode 0 1 4 20 0 16 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2020 0 2020 2020 2020 2020 2020 ▁▁▇▁▁
candidatevotes 1 1 7173 42426 0 15 177 2341 3028885 ▇▁▁▁▁
totalvotes 5 1 50118 140287 66 6190 13669 36638 4264365 ▇▁▁▁▁
version 0 1 20210622 0 20210622 20210622 20210622 20210622 20210622 ▁▁▇▁▁
#We can see that there are 9 missing values. 

We need to clean up the missing values.

#remove missing values
election2020_results <- na.omit(election2020_results)

Then, calculate the rate of votes for Trump in each county.

vote_data <- election2020_results %>% 
  select(county_name, fips, candidate, candidatevotes, totalvotes) %>% 
  filter(candidate == "DONALD J TRUMP") %>% 
  group_by(county_name, fips, totalvotes) %>% 
  summarise(vote_sum = sum(candidatevotes)) %>% 
  mutate(vote_rate = vote_sum/totalvotes)

Then, filter the most recent data from vaccinations, in this case, september data.

sept_vaccinations <- vaccinations %>% 
  select(date, fips,series_complete_pop_pct) %>% 
  filter(date == "09/04/2021")

#After we filtered the dataset, we can conduct cleaning so that the code is more efficient regarding to a smaller dataset.
sept_vaccinations <- na.omit(sept_vaccinations) %>% 
  unique()

Next, join the september vaccination percentage to the percentage of votes for each county by each candidate.

pct_trump_votes_vacc <- inner_join (vote_data, 
                               sept_vaccinations , 
                               by="fips")

Next, join the population size for each county into the percentage trump votes vaccination table.

votes_vacc_pop <- inner_join (pct_trump_votes_vacc, 
                               population , 
                               by="fips")

Then, plot the relationship between percentage of trump vote and percentage of population vaccinated.

vac_pop_plot <- votes_vacc_pop %>% 
  ggplot(aes(x=vote_rate,y=series_complete_pop_pct/100, size=pop_estimate_2019)) + 
  
  #fit a regression line to the plot
  geom_smooth(method=lm) + 
  
  #aethetics for population size
  geom_point(color = "black", alpha = 0.2) +
  geom_point(size = 0.3)+
  scale_size(range = c(0, 30)) +
  
  #draw the dash lines
  geom_hline(yintercept=0.85, linetype="dashed") +
  geom_hline(yintercept=0.539, linetype="dashed") +
  geom_hline(yintercept=0.508, linetype="dashed") +
  
  #break the graph into 3x3 grids with solid lines
  geom_hline(yintercept=0.4, linetype="solid") +
  geom_hline(yintercept=0.6, linetype="solid") +
  geom_vline(xintercept=0.4, linetype="solid") +
  geom_vline(xintercept=0.6, linetype="solid") +
  labs(title = "COVID 19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY", 
       x = "2020 Trump Vote %", 
       y = "% of Total Population Vaccinated") +
  NULL

#add the text messages
vac_pop_plot + annotate("text", x = 0.2,y = 0.89, label = "Herd Immunity Threshold (?)",color = 'blue',size = 3) +
  annotate("text", x = 0.05, y = 0.56, label = "TARGET:53.9%", fontface = 'italic', color = 'blue',size = 3) +
  annotate("text", x = 0.05, y = 0.525, label = "ACTUAL:50.8%", fontface = 'italic', color = 'blue',size = 3)+
  annotate("text", x = 0.3, y = 0.1, label = "09/04/2021", fontface = 'italic', color = 'red') +
  annotate("text", x = 0.45, y = 1, label = "EVERY U.S. COUNTY", size = 6) +
  annotate("text", x = 0.1, y = 0.06, label = deparse(bquote(~R^2 ==~ 0.4216)),
            parse = T, size = 3, color = 'red')+
  annotate("text", x = 0.1, y = 0.1, label = "y = -0.4224x + 0.6451", color = 'red',size = 3) +
  scale_y_continuous(labels = scales::percent)+
  scale_x_continuous(labels = scales::percent, limits = c(0,1))+
  theme(legend.position="none")

6 Challenge 2: Opinion polls for the 2021 German elections

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.

The following code will scrape the wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )
glimpse(german_election_polls)
## Rows: 214
## Columns: 16
## $ polling_firm   <chr> "INSA", "Forschungsgruppe Wahlen", "Infratest dimap", "…
## $ fieldwork_date <chr> "30 Aug – 3 Sep 2021", "31 Aug – 2 Sep 2021", "30 Aug –…
## $ samplesize     <chr> "1,427", "1,301", "1,337", "2,340", "1,729", "1,439", "…
## $ abs            <chr> "–", "29", "–", "–", "–", "–", "–", "24", "–", "–", "27…
## $ union          <dbl> 20.0, 22.0, 20.0, 19.5, 20.0, 21.0, 20.0, 21.0, 21.0, 2…
## $ spd            <dbl> 25.0, 25.0, 25.0, 27.0, 25.0, 25.0, 25.0, 23.0, 25.0, 2…
## $ af_d           <dbl> 12.0, 11.0, 12.0, 10.5, 12.0, 11.0, 11.0, 11.0, 11.0, 1…
## $ fdp            <dbl> 13.0, 11.0, 13.0, 13.0, 13.0, 11.0, 13.5, 12.0, 11.0, 1…
## $ linke          <dbl> 7.0, 7.0, 6.0, 6.5, 8.0, 7.0, 7.0, 6.0, 7.0, 6.0, 6.0, …
## $ grune          <dbl> 16.0, 17.0, 16.0, 15.5, 15.0, 19.0, 16.5, 18.0, 19.0, 1…
## $ fw             <chr> "–", "–", "–", "–", "–", "–", "–", "–", "–", "–", "3", …
## $ others         <chr> "7", "7", "8", "8", "8", "6", "7", "9", "6", "8", "6", …
## $ lead           <chr> "5", "3", "5", "7.5", "5", "4", "5", "2", "4", "3", "Ti…
## $ end_date       <date> 2021-09-03, 2021-09-02, 2021-09-01, 2021-08-31, 2021-0…
## $ month          <dbl> 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
## $ week           <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 34, 34, 34, 34, 34, 34,…
#selecting the columns that are useful
selected_data <- german_election_polls %>% 
  select (samplesize, union, spd, af_d, fdp, linke, grune, end_date)

#change "samplesize" to double
selected_data$samplesize <- as.numeric(gsub(",","",selected_data$samplesize)) 

election_data_raw <- selected_data %>% 
  mutate (number_of_vote_union = samplesize*union/100, 
          number_of_vote_spd = samplesize*spd/100, 
          number_of_vote_af_d = samplesize*af_d/100, 
          number_of_vote_fdp = samplesize*fdp/100, 
          number_of_vote_linke = samplesize*linke/100, 
          number_of_vote_grune = samplesize*grune/100) %>% 
  
  #calculate the number of votes for each party (sample size*percenrage of vote)
  group_by(end_date) %>% 
  mutate (totalsample = sum(samplesize)) %>% 
  
  #the total sample size of the result given by each company on the same date 
  mutate (sum_vote_union = sum(number_of_vote_union), 
          sum_vote_spd = sum(number_of_vote_spd), 
          sum_vote_af_d = sum(number_of_vote_af_d), 
          sum_vote_fdp = sum(number_of_vote_fdp), 
          sum_vote_linke = sum(number_of_vote_linke), 
          sum_vote_grune = sum(number_of_vote_grune)) %>% 
  
    #calculate the average percentage of vote by using the total number of votes for the party / the total sample size
  mutate (average_rate_union = sum_vote_union/totalsample, 
          average_rate_spd = sum_vote_spd/totalsample, 
          average_rate_af_d = sum_vote_af_d/totalsample, 
          average_rate_fdp = sum_vote_fdp/totalsample, 
          average_rate_linke = sum_vote_linke/totalsample, 
          average_rate_grune = sum_vote_grune/totalsample)

#remove the duplicated rows for the same date
election_data <- election_data_raw[!duplicated(election_data_raw$totalsample), ]
election_data = na.omit(election_data)

roll_mean_union = zoo::rollmean(election_data$average_rate_union,k = 14, fill = NA, align = "left")

roll_mean_spd = zoo::rollmean(election_data$average_rate_spd,k = 14, fill = NA, align = "left")

roll_mean_af_d = zoo::rollmean(election_data$average_rate_af_d,k = 14, fill = NA, align = "left")

roll_mean_fdp = zoo::rollmean(election_data$average_rate_fdp,k = 14, fill = NA, align = "left")

roll_mean_linke = zoo::rollmean(election_data$average_rate_linke,k = 14, fill = NA, align = "left")

roll_mean_grune= zoo::rollmean(election_data$average_rate_grune,k = 14, fill = NA, align = "left")
election_data = cbind(election_data,
                      roll_mean_union=roll_mean_union,
                      roll_mean_spd=roll_mean_spd,
                      roll_mean_af_d=roll_mean_af_d,
                      roll_mean_fdp=roll_mean_fdp,
                      roll_mean_linke=roll_mean_linke,
                      roll_mean_grune=roll_mean_grune)
colors = c("CDU/CSU" = "black", "SPD" = "red3", "AfD" = "midnightblue", "FDP" = "orange1",
          "Linke" = "mediumvioletred", "Grune" = "green")

election_data %>% 
  ggplot(aes(x = end_date))+
  geom_point(aes(y = average_rate_union), alpha=0.3, size=1) + 
  geom_point(aes(y = average_rate_spd), color="red3", alpha=0.3, size=1) +
  geom_point(aes(y = average_rate_af_d), color="midnightblue", alpha=0.3, size=1)+
  geom_point(aes(y = average_rate_fdp), color="orange1", alpha=0.3, size=1)+
  geom_point(aes(y = average_rate_linke), color="mediumvioletred", alpha=0.3, size=1)+
  geom_point(aes(y = average_rate_grune), color="green", alpha=0.3, size=1)+
  geom_line(aes(y = roll_mean_union, color="CDU/CSU"), size=1) + 
  geom_line(aes(y = roll_mean_spd, color="SPD"),size=1) +
  geom_line(aes(y = roll_mean_af_d, color="AfD"),size=1)+
  geom_line(aes(y = roll_mean_fdp, color="FDP"),size=1)+
  geom_line(aes(y = roll_mean_linke, color="Linke"),size=1)+
  geom_line(aes(y = roll_mean_grune,  color="Grune"),size=1)+
  labs(title = "German Election Poll", x="Date", y="Percentage of vote")+
  scale_y_continuous(labels = scales::percent)+
  scale_x_date(labels = date_format("%B-%Y"))+
  theme(legend.position = "right",legend.background = element_rect(fill="snow", 
                                  size=0.5, linetype="solid")) +
  theme_bw()+
  scale_color_manual(values = colors, name = "Parties")

7 Details

  • Who did you collaborate with: Linli Ding, Yifan Yang, Clemens Scherf, Chandrima Tolia, Edoardo Ferri, Hamish Thomas
  • Approximately how much time did you spend on this problem set: 2 days
  • What, if anything, gave you the most trouble: Debugging

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else? Yeah